159 research outputs found

    Examination of Genome Homogeneity in Prokaryotes Using Genomic Signatures

    Get PDF
    BACKGROUND:DNA word frequencies, normalized for genomic AT content, are remarkably stable within prokaryotic genomes and are therefore said to reflect a "genomic signature." The genomic signatures can be used to phylogenetically classify organisms from arbitrary sampled DNA. Genomic signatures can also be used to search for horizontally transferred DNA or DNA regions subjected to special selection forces. Thus, the stability of the genomic signature can be used as a measure of genomic homogeneity. The factors associated with the stability of the genomic signatures are not known, and this motivated us to investigate further. We analyzed the intra-genomic variance of genomic signatures based on AT content normalization (0(th) order Markov model) as well as genomic signatures normalized by smaller DNA words (1(st) and 2(nd) order Markov models) for 636 sequenced prokaryotic genomes. Regression models were fitted, with intra-genomic signature variance as the response variable, to a set of factors representing genomic properties such as genomic AT content, genome size, habitat, phylum, oxygen requirement, optimal growth temperature and oligonucleotide usage variance (OUV, a measure of oligonucleotide usage bias), measured as the variance between genomic tetranucleotide frequencies and Markov chain approximated tetranucleotide frequencies, as predictors. PRINCIPAL FINDINGS:Regression analysis revealed that OUV was the most important factor (p<0.001) determining intra-genomic homogeneity as measured using genomic signatures. This means that the less random the oligonucleotide usage is in the sense of higher OUV, the more homogeneous the genome is in terms of the genomic signature. The other factors influencing variance in the genomic signature (p<0.001) were genomic AT content, phylum and oxygen requirement. CONCLUSIONS:Genomic homogeneity in prokaryotes is intimately linked to genomic GC content, oligonucleotide usage bias (OUV) and aerobiosis, while oligonucleotide usage bias (OUV) is associated with genomic GC content, aerobiosis and habitat

    Abundant Oligonucleotides Common to Most Bacteria

    Get PDF
    BACKGROUND: Bacteria show a bias in their genomic oligonucleotide composition far beyond that dictated by G+C content. Patterns of over- and underrepresented oligonucleotides carry a phylogenetic signal and are thus diagnostic for individual species. Patterns of short oligomers have been investigated by multiple groups in large numbers of bacteria genomes. However, global distributions of the most highly overrepresented mid-sized oligomers have not been assessed across all prokaryotes to date. We surveyed overrepresented mid-length oligomers across all prokaryotes and normalised for base composition and embedded oligomers using zero and second order Markov models. PRINCIPAL FINDINGS: Here we report a presumably ancient set of oligomers conserved and overrepresented in nearly all branches of prokaryotic life, including Archaea. These oligomers are either adenine rich homopurines with one to three guanine nucleosides, or homopyridimines with one to four cytosine nucleosides. They do not show a consistent preference for coding or non-coding regions or aggregate in any coding frame, implying a role in DNA structure and as polypeptide binding sites. Structural parameters indicate these oligonucleotides to be an extreme and rigid form of B-DNA prone to forming triple stranded helices under common physiological conditions. Moreover, the narrow minor grooves of these structures are recognised by DNA binding and nucleoid associated proteins such as HU. CONCLUSION: Homopurine and homopyrimidine oligomers exhibit distinct and unusual structural features and are present at high copy number in nearly all prokaryotic lineages. This fact suggests a non-neutral role of these oligonucleotides for bacterial genome organization that has been maintained throughout evolution

    Frequent toggling between alternative amino acids is driven by selection in HIV-1

    Get PDF
    Author Summary Viruses, such as HIV, are able to evade host immune responses through escape mutations, yet sometimes they do so at a cost. This cost is the reduction in the ability of the virus to replicate, and thus selective pressure exists for a virus to revert to its original state in the absence of the host immune response that caused the initial escape mutation. This pattern of escape and reversion typically occurs when viruses are transmitted between individuals with different immune responses. We develop a phylogenetic model of immune escape and reversion and provide evidence that it outperforms existing models for the detection of selective pressure associated with host immune responses. Finally, we demonstrate that amino acid toggling is a pervasive process in HIV-1 evolution, such that many of the positions in the virus that evolve rapidly, under the influence of positive Darwinian selection, nonetheless display quite low sequence diversity. This highlights the limitations of HIV-1 evolution, and sites such as these are potentially good targets for HIV-1 vaccines

    Benchmarking multi-rate codon models

    Get PDF
    CITATION: Delport, W. et al. 2010. Benchmarking multi-rate codon models. PLoS ONE, 5(7): e11587, doi:10.1371/journal.pone.0011587.The original publication is available at http://journals.plos.org/plosoneThe single rate codon model of non-synonymous substitution is ubiquitous in phylogenetic modeling. Indeed, the use of a non-synonymous to synonymous substitution rate ratio parameter has facilitated the interpretation of selection pressure on genomes. Although the single rate model has achieved wide acceptance, we argue that the assumption of a single rate of non-synonymous substitution is biologically unreasonable, given observed differences in substitution rates evident from empirical amino acid models. Some have attempted to incorporate amino acid substitution biases into models of codon evolution and have shown improved model performance versus the single rate model. Here, we show that the single rate model of non-synonymous substitution is easily outperformed by a model with multiple non-synonymous rate classes, yet in which amino acid substitution pairs are assigned randomly to these classes. We argue that, since the single rate model is so easy to improve upon, new codon models should not be validated entirely on the basis of improved model fit over this model. Rather, we should strive to both improve on the single rate model and to approximate the general time-reversible model of codon substitution, with as few parameters as possible, so as to reduce model over-fitting. We hint at how this can be achieved with a Genetic Algorithm approach in which rate classes are assigned on the basis of sequence information content. © 2010 Delport et al.http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0011587Publisher's versio

    Parallel and Convergent Evolution of the Dim-Light Vision Gene RH1 in Bats (Order: Chiroptera)

    Get PDF
    Rhodopsin, encoded by the gene Rhodopsin (RH1), is extremely sensitive to light, and is responsible for dim-light vision. Bats are nocturnal mammals that inhabit poor light environments. Megabats (Old-World fruit bats) generally have well-developed eyes, while microbats (insectivorous bats) have developed echolocation and in general their eyes were degraded, however, dramatic differences in the eyes, and their reliance on vision, exist in this group. In this study, we examined the rod opsin gene (RH1), and compared its evolution to that of two cone opsin genes (SWS1 and M/LWS). While phylogenetic reconstruction with the cone opsin genes SWS1 and M/LWS generated a species tree in accord with expectations, the RH1 gene tree united Pteropodidae (Old-World fruit bats) and Yangochiroptera, with very high bootstrap values, suggesting the possibility of convergent evolution. The hypothesis of convergent evolution was further supported when nonsynonymous sites or amino acid sequences were used to construct phylogenies. Reconstructed RH1 sequences at internal nodes of the bat species phylogeny showed that: (1) Old-World fruit bats share an amino acid change (S270G) with the tomb bat; (2) Miniopterus share two amino acid changes (V104I, M183L) with Rhinolophoidea; (3) the amino acid replacement I123V occurred independently on four branches, and the replacements L99M, L266V and I286V occurred each on two branches. The multiple parallel amino acid replacements that occurred in the evolution of bat RH1 suggest the possibility of multiple convergences of their ecological specialization (i.e., various photic environments) during adaptation for the nocturnal lifestyle, and suggest that further attention is needed on the study of the ecology and behavior of bats

    Robust inference of positive selection from recombining coding sequences.

    Get PDF
    Motivation: Accurate detection of positive Darwinian selection can provide important insights to researchers investigating the evolution of pathogens. However, many pathogens (particularly viruses) undergo frequent recombination and the phylogenetic methods commonly applied to detect positive selection have been shown to give misleading results when applied to recombining sequences. We propose a method that makes maximum likelihood inference of positive selection robust to the presence of recombination. This is achieved by allowing tree topologies and branch lengths to change across detected recombination breakpoints. Further improvements are obtained by allowing synonymous substitution rates to vary across sites. Results: Using simulation we show that, even for extreme cases where recombination causes standard methods to reach false positive rates >90%, the proposed method decreases the false positive rate to acceptable levels while retaining high power. We applied the method to two HIV-1 datasets for which we have previously found that inference of positive selection is invalid owing to high rates of recombination. In one of these (env gene) we still detected positive selection using the proposed method, while in the other (gag gene) we found no significant evidence of positive selection. Availability: A HyPhy batch language implementation of the proposed methods and the HIV-1 datasets analysed are available at http://www.cbio.uct.ac.za/pub_support/bioinf06. The HyPhy package is available at http://www.hyphy.org,and it is planned that the proposed methods will be included in the next distribution. RDP2 is available at http://darwin.uvigo.es/rdp/rdp.html.This study was supported by the South African National Bioinformatics Network and by the National Institute of Allergy and Infectious Disease and the National Institutes of Health through the Centre for the AIDS Programme of Research in South Africa (grant no. 1U19AI51794). Funding to pay the Open Access publication charges was provided by the South African National Bioinformatics Network

    Sequence alignment, mutual information, and dissimilarity measures for constructing phylogenies

    Get PDF
    Existing sequence alignment algorithms use heuristic scoring schemes which cannot be used as objective distance metrics. Therefore one relies on measures like the p- or log-det distances, or makes explicit, and often simplistic, assumptions about sequence evolution. Information theory provides an alternative, in the form of mutual information (MI) which is, in principle, an objective and model independent similarity measure. MI can be estimated by concatenating and zipping sequences, yielding thereby the "normalized compression distance". So far this has produced promising results, but with uncontrolled errors. We describe a simple approach to get robust estimates of MI from global pairwise alignments. Using standard alignment algorithms, this gives for animal mitochondrial DNA estimates that are strikingly close to estimates obtained from the alignment free methods mentioned above. Our main result uses algorithmic (Kolmogorov) information theory, but we show that similar results can also be obtained from Shannon theory. Due to the fact that it is not additive, normalized compression distance is not an optimal metric for phylogenetics, but we propose a simple modification that overcomes the issue of additivity. We test several versions of our MI based distance measures on a large number of randomly chosen quartets and demonstrate that they all perform better than traditional measures like the Kimura or log-det (resp. paralinear) distances. Even a simplified version based on single letter Shannon entropies, which can be easily incorporated in existing software packages, gave superior results throughout the entire animal kingdom. But we see the main virtue of our approach in a more general way. For example, it can also help to judge the relative merits of different alignment algorithms, by estimating the significance of specific alignments.Comment: 19 pages + 16 pages of supplementary materia

    Correcting the Bias of Empirical Frequency Parameter Estimators in Codon Models

    Get PDF
    Markov models of codon substitution are powerful inferential tools for studying biological processes such as natural selection and preferences in amino acid substitution. The equilibrium character distributions of these models are almost always estimated using nucleotide frequencies observed in a sequence alignment, primarily as a matter of historical convention. In this note, we demonstrate that a popular class of such estimators are biased, and that this bias has an adverse effect on goodness of fit and estimates of substitution rates. We propose a “corrected” empirical estimator that begins with observed nucleotide counts, but accounts for the nucleotide composition of stop codons. We show via simulation that the corrected estimates outperform the de facto standard estimates not just by providing better estimates of the frequencies themselves, but also by leading to improved estimation of other parameters in the evolutionary models. On a curated collection of sequence alignments, our estimators show a significant improvement in goodness of fit compared to the approach. Maximum likelihood estimation of the frequency parameters appears to be warranted in many cases, albeit at a greater computational cost. Our results demonstrate that there is little justification, either statistical or computational, for continued use of the -style estimators
    corecore